varepsilon 2
Finding good policies in average-reward Markov Decision Processes without prior knowledge
We revisit the identification of an \varepsilon -optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, D, and the optimal bias span, H, which satisfy H\leq D . Prior work have studied the complexity of \varepsilon -optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with D \simeq H for which the sample complexity to output an \varepsilon -optimal policy is \Omega(SAD/\varepsilon 2) where S and A are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order SAH/\varepsilon 2 has been proposed, but it requires the knowledge of H .
Privacy Amplification via Compression: Achieving the Optimal Privacy-Accuracy-Communication Trade-off in Distributed Mean Estimation
Privacy and communication constraints are two major bottlenecks in federated learning (FL) and analytics (FA). We study the optimal accuracy of mean and frequency estimation (canonical models for FL and FA respectively) under joint communication and (\varepsilon, \delta) -differential privacy (DP) constraints. We consider both the central and the multi-message shuffled DP models. Without compression, each client needs O(d) bits and O\left(\log d\right) bits for the mean and frequency estimation problems respectively (where d corresponds to the number of trainable parameters in FL or the domain size in FA), meaning that we can get significant savings in the regime n \min\left(\varepsilon, \varepsilon 2\right) o(d), which is often the relevant regime in practice. In both cases, each client communicates only partial information about its sample and we show that privacy is amplified by randomly selecting the part contributed by each client.
Byzantine Stochastic Gradient Descent
Alistarh, Dan, Allen-Zhu, Zeyuan, Li, Jerry
This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and may behave adversarially. In contrast, traditional mini-batch SGD needs $T O\big( \frac{1}{\varepsilon 2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sample complexity and time complexity. Papers published at the Neural Information Processing Systems Conference.